Analyze the impact of events on the completion of the target event — viewing contacts
The dataset contains data on events committed in the "Unnecessary Things" mobile application. In it, users sell their unwanted items by posting them on a bulletin board.
The dataset contains users data that was performed in the application for the first time after October 7, 2019.
Columns in mobile_sources.csv:
Columns in mobile_dataset.csv:
Types of actions:
#
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime, timedelta
from plotly import graph_objects as go
import os
from plotly.subplots import make_subplots
from scipy import stats as st
import math as mth
pth1 = 'mobile_sources.csv'
pth2 = 'datasets/mobile_sources.csv'
if os.path.exists(pth1):
sources = pd.read_csv(pth1, sep=',')
elif os.path.exists(pth2):
sources = pd.read_csv(pth2, sep=',')
else:
print('Path not found')
pth1 = 'mobile_dataset.csv'
pth2 = 'datasets/mobile_dataset.csv'
if os.path.exists(pth1):
data = pd.read_csv(pth1, sep=',', parse_dates=['event.time'])
elif os.path.exists(pth2):
data = pd.read_csv(pth2, sep=',', parse_dates=['event.time'])
else:
print('Path not found')
#
def info(data):
print('------------- First 5 lines ------------')
display(data.sample(5))
print('------------- Data types ------------')
display(data.info())
print('------------- Gaps ------------')
for element in data.columns:
if data[element].isna().any().mean() > 0:
print(element, '-', data[element].isna().sum())
else:
print(element, '- None')
print('------------- Duplicates ------------')
if data.duplicated().sum() > 0:
print(data.duplicated().sum())
else:
print('No Duplicates');
info(sources)
------------- First 5 lines ------------
| userId | source | |
|---|---|---|
| 1254 | c194f363-da79-4019-88ce-4a97466a08bc | yandex |
| 1800 | 09e3e8fa-8a11-41c5-be3a-ffd275fc6f38 | yandex |
| 3198 | 79910090-aaad-418e-88e0-b9ece782c13b | yandex |
| 3165 | c9c58582-ea41-4b0a-925a-a9f9b8eabf30 | yandex |
| 261 | eb89a159-6d63-4959-8059-1fc44c73c6e6 |
------------- Data types ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 4293 entries, 0 to 4292 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userId 4293 non-null object 1 source 4293 non-null object dtypes: object(2) memory usage: 67.2+ KB
None
------------- Gaps ------------ userId - None source - None ------------- Duplicates ------------ No Duplicates
info(data)
------------- First 5 lines ------------
| event.time | event.name | user.id | |
|---|---|---|---|
| 61231 | 2019-10-30 02:33:13.454263 | search_2 | fdf57c9e-863a-4057-8b4a-782494f7c107 |
| 22471 | 2019-10-16 12:32:07.708267 | search_1 | f261b81c-57d2-4a9c-a4f5-3153e5f9b503 |
| 32298 | 2019-10-20 09:27:29.155047 | tips_show | 779317c3-e226-43bd-887f-2307d441b955 |
| 61518 | 2019-10-30 09:56:42.980757 | map | b6b8815c-fa73-46af-9df2-a733a8515e88 |
| 58353 | 2019-10-29 09:39:21.827112 | tips_show | 558b56e9-fc34-438d-9d12-74cd3d864dfa |
------------- Data types ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 74197 entries, 0 to 74196 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event.time 74197 non-null datetime64[ns] 1 event.name 74197 non-null object 2 user.id 74197 non-null object dtypes: datetime64[ns](1), object(2) memory usage: 1.7+ MB
None
------------- Gaps ------------ event.time - None event.name - None user.id - None ------------- Duplicates ------------ No Duplicates
data = data.rename(columns={'event.time':'event_time', 'event.name':'event_name', 'user.id':'user_id'})
sources = sources.rename(columns={'userId':'user_id'})
display(data.sample(), sources.sample())
data.to_csv('data_clear.csv')
| event_time | event_name | user_id | |
|---|---|---|---|
| 57695 | 2019-10-28 22:20:56.181555 | tips_show | 6d0e0712-85b4-43de-b8d2-4048540a4e9d |
| user_id | source | |
|---|---|---|
| 128 | c88b0187-dd82-4a29-a471-b1df72a9aacb | other |
data['date'] = pd.DatetimeIndex(data['event_time']).date
data['time'] = pd.DatetimeIndex(data['event_time']).time
data.sample()
| event_time | event_name | user_id | date | time | |
|---|---|---|---|---|---|
| 9174 | 2019-10-10 21:31:41.301092 | tips_show | aa3f2890-770b-443f-8a31-40750ed39bb5 | 2019-10-10 | 21:31:41.301092 |
display(f'First date {data["event_time"].min()}')
display(f'Last date {data["event_time"].max()}')
'First date 2019-10-07 00:00:00.431357'
'Last date 2019-11-03 23:58:12.532487'
There are no Duplicates nor gaps in both datasets.
Column names has been brought to unified style.
Observation period is nearly a month: 2019-10-07 - 2019-11-03
display(f'{data["user_id"].nunique()} Unique users ')
'4293 Unique users '
display(f'{data["event_name"].nunique()} Unique events ')
'16 Unique events '
display('Unique users per event', data.groupby("event_name")["user_id"].nunique().sort_values(ascending=False))
'Unique users per event'
event_name tips_show 2801 map 1456 photos_show 1095 contacts_show 979 search_1 787 advert_open 751 search_5 663 search_4 474 favorites_add 351 search_6 330 tips_click 322 search_2 242 contacts_call 213 search_3 208 search_7 157 show_contacts 7 Name: user_id, dtype: int64
Let's interchange 7 events "show_contacts" to "contacts_show" as they seem literally single-valued.
data['event_name'] = data['event_name'].replace({'show_contacts':'contacts_show'})
display('Unique users per event', data.groupby("event_name")["user_id"].nunique().sort_values(ascending=False))
'Unique users per event'
event_name tips_show 2801 map 1456 photos_show 1095 contacts_show 981 search_1 787 advert_open 751 search_5 663 search_4 474 favorites_add 351 search_6 330 tips_click 322 search_2 242 contacts_call 213 search_3 208 search_7 157 Name: user_id, dtype: int64
The logic behind division by session has been taken as follows:
We assume that session is a flow of events taken by unique user with timediffer between events of one user more then 20 minutes.
sessions = (data.sort_values(['user_id','event_time']).groupby('user_id')['event_time'].diff() > pd.Timedelta('20Min')).cumsum()
data['session_id'] = data.groupby(['user_id', sessions], sort = False).ngroup() + 1
display(data.head(5))
| event_time | event_name | user_id | date | time | session_id | |
|---|---|---|---|---|---|---|
| 0 | 2019-10-07 00:00:00.431357 | advert_open | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:00.431357 | 1 |
| 1 | 2019-10-07 00:00:01.236320 | tips_show | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:01.236320 | 1 |
| 2 | 2019-10-07 00:00:02.245341 | tips_show | cf7eda61-9349-469f-ac27-e5b6f5ec475c | 2019-10-07 | 00:00:02.245341 | 2 |
| 3 | 2019-10-07 00:00:07.039334 | tips_show | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:07.039334 | 1 |
| 4 | 2019-10-07 00:00:56.319813 | advert_open | cf7eda61-9349-469f-ac27-e5b6f5ec475c | 2019-10-07 | 00:00:56.319813 | 2 |
display(f'Out of all {len(data)} events, There are {data.session_id.nunique()} sessions')
'Out of all 74197 events, There are 10975 sessions'
display(f'Out of {data.session_id.nunique()} sessions, There are {(data.groupby("session_id")["event_name"].count() == 1).sum()} sessions consisting of 1 event')
'Out of 10975 sessions, There are 2328 sessions consisting of 1 event'
As our goal is to analyze the impact of preceding events on the target event "contacts_show",
we can eliminate sessions consisting of 2 unique events by creating following filters by sessions:
session1 = data.groupby('session_id')['event_name'].nunique().reset_index()
session1 = session1.query('event_name > 2')
len(session1)
2296
Creation of a new dataset "data2" containing only sessions longer than 1 event and with variety of 2 or more events
data2 = data.query('session_id in @session1.session_id')
data2.head(5)
| event_time | event_name | user_id | date | time | session_id | |
|---|---|---|---|---|---|---|
| 0 | 2019-10-07 00:00:00.431357 | advert_open | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:00.431357 | 1 |
| 1 | 2019-10-07 00:00:01.236320 | tips_show | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:01.236320 | 1 |
| 2 | 2019-10-07 00:00:02.245341 | tips_show | cf7eda61-9349-469f-ac27-e5b6f5ec475c | 2019-10-07 | 00:00:02.245341 | 2 |
| 3 | 2019-10-07 00:00:07.039334 | tips_show | 020292ab-89bc-4156-9acf-68bc2783f894 | 2019-10-07 | 00:00:07.039334 | 1 |
| 4 | 2019-10-07 00:00:56.319813 | advert_open | cf7eda61-9349-469f-ac27-e5b6f5ec475c | 2019-10-07 | 00:00:56.319813 | 2 |
display(f'Data has been reduced to {len(data2)} from {len(data)}')
display(f'Number of sessions has been reduced to {data2["session_id"].nunique()} from {data["session_id"].nunique()}')
'Data has been reduced to 29492 from 74197'
'Number of sessions has been reduced to 2296 from 10975'
10975 sessions has been allocated.
New dataset 'data2' has been created: containing only sessions longer than 2 unique events and containing target event. In result:
Data has been reduced to 29492 from 74197 and
Number of sessions has been reduced to 2296 from 10975
#
trace1 = go.Histogram(
x=data['event_name'],
opacity=0.75,
name='Initial Data',
marker_color='#3A8DF6')
trace2 = go.Histogram(
x=data2['event_name'],
opacity=0.75,
name='Cleared Data',
marker_color='#92f63a')
graph = [trace1, trace2]
layout = go.Layout(
title='Frequency of events in initial and cleared datasets',
barmode='overlay',
height=900,
xaxis=dict(
title='Event name',
categoryorder='total descending'),
yaxis=dict(
title='Quantity of events',
overlaying='y'),)
fig = go.Figure(data=graph, layout=layout)
fig.show()
After clearing the data it appeared that the most popular events "tips_show", "photos_show" and "search_1" in 70% of the time is involved into pointless sessions conisting only of the one event.
Directed conscious actions demanding an active action from user like "advert_open", "contacts_show" and "map" has shown less reduction.
session2 = data2.sort_values(by='event_name').groupby('session_id', as_index=False).agg(event_flow=('event_name', 'unique'))
session2['event_flow'] = [', '.join(map(str, i)) for i in session2['event_flow']]
funnel = session2['event_flow'].value_counts()
display('Top-10 event steps:', funnel.head(10))
'Top-10 event steps:'
advert_open, map, tips_show 300 contacts_show, map, tips_show 176 advert_open, map, search_3, tips_show 86 search_4, search_5, search_6, tips_show 80 contacts_show, photos_show, search_1 75 search_5, search_6, tips_show 75 favorites_add, photos_show, search_1 64 search_4, search_5, tips_show 63 contacts_call, contacts_show, photos_show 62 contacts_call, contacts_show, search_1 59 Name: event_flow, dtype: int64
#
fig = make_subplots(rows=1, cols=4)
fig.update_layout(showlegend=False,height=800, width=1000,
title=(f'Funnels of Top-4 successful event scenarios in cleared data'), title_font_size = 20 )
session_suc = session2.query('event_flow.str.contains("contacts_show")', engine='python')
funnel_suc = session_suc['event_flow'].value_counts()
for i in range(0,4):
f = funnel_suc.index[i].split(', ')
fun = data2.query('event_name == @f').groupby('event_name', as_index=False)['user_id'].nunique().sort_values(by='user_id', ascending=False)
fig.add_trace(go.Funnel(
x = fun['user_id'] ,
y = fun['event_name'],
textposition = "inside",
texttemplate="%{label}: %{value:} <br> %{percentPrevious}",
textangle=-90,
marker = {"color": ["#EBF63A", "#92f63a", "#3A8DF6"]}),
row=1, col=i+1 )
fig.update_yaxes(visible=False)
fig.show()
From Funnel we can see that the most common events leading to "Contacts show" are the following:
fig = px.bar( y=funnel[:10].index, x = funnel[:10].values, color=funnel[:10].index.str.contains('contacts_show'),
color_discrete_sequence=['#3A8DF6', '#92f63a'])
fig.update_layout(title='Frequency of events in cleared data',
yaxis={'categoryorder':'total ascending'},
yaxis_title='Quantity of events',
xaxis_title='Event name',
legend_title="Hit Target event")
fig.show()
events_main = ['tips_show', 'map', 'photos_show', 'search_1', 'favorites_add', 'advert_open', 'tips_click']
for i in range(len(events_main)):
x = events_main[i]
target = "contacts_show"
a = data2.query("event_name == @x")
b = data2.query("session_id in @a.session_id and event_name == @target")
au = a.user_id.nunique()
bu = b.user_id.nunique()
display(f'Percent of "{events_main[i]}" events in initial data that lead to target "Contacts show": ')
display(f'{ bu/au :.3%}')
'Percent of "tips_show" events in initial data that lead to target "Contacts show": '
'25.063%'
'Percent of "map" events in initial data that lead to target "Contacts show": '
'26.039%'
'Percent of "photos_show" events in initial data that lead to target "Contacts show": '
'66.537%'
'Percent of "search_1" events in initial data that lead to target "Contacts show": '
'66.960%'
'Percent of "favorites_add" events in initial data that lead to target "Contacts show": '
'42.661%'
'Percent of "advert_open" events in initial data that lead to target "Contacts show": '
'16.491%'
'Percent of "tips_click" events in initial data that lead to target "Contacts show": '
'37.349%'
We can see that the user action "tips_show" and "map" are the most common ways to reach target event.
But this actions make only around 25% conversion to "Contacts show".
It seems that the raise of conversion of "tips_show" and "map" will increase number of "Contacts show" events.
"Photos_show" and "search_1" make around 67% conversion to "Contacts show".
Raise of commiting "photos_show" and "search_1" is also an option, as they show the biggest conversion.
"Favorites_add" makes 42% conversion to "Contacts show".
Raise of commiting "Favorites_add"
Also "tips_click" is showing pretty good conversion -38%, but takes a small part of events flows.
Raise of commiting "tips_click"
"Advert_open" makes the smallest conversion to target event - 15%
Raise of conversion "Advert_open"
#
session = []
for i in range(len(events_main)):
x = events_main[i]
session_x = data2.query('session_id in @session_suc.session_id and event_name == @x').groupby('session_id', as_index=False).agg({'event_name':'first','event_time':'min'})
session_t = data2.query('session_id in @session_x.session_id and event_name == @target').groupby('session_id', as_index=False)['event_time'].min()
session_x = session_x.merge(session_t, on='session_id')
session_x['diff'] = session_x['event_time_x'] - session_x['event_time_y']
session_x['diff'] = session_x['diff'].dt.total_seconds().abs()
display(f'"{events_main[i]}" and "Contacts show" median timediffer:')
display(f'{ session_x["diff"].median() :.5}, seconds')
session.append(session_x)
session = pd.concat(session)
display(f'Mean timediffer between prevalent events and target "Contacts show":{ session["diff"].median() :.5}, seconds')
plt.figure(figsize=(16, 8))
plt.ylim([-1,3000])
print(sns.boxplot(x='event_name', y='diff', data=session))
'"tips_show" and "Contacts show" median timediffer:'
'278.46, seconds'
'"map" and "Contacts show" median timediffer:'
'342.5, seconds'
'"photos_show" and "Contacts show" median timediffer:'
'306.74, seconds'
'"search_1" and "Contacts show" median timediffer:'
'262.93, seconds'
'"favorites_add" and "Contacts show" median timediffer:'
'277.9, seconds'
'"advert_open" and "Contacts show" median timediffer:'
'260.23, seconds'
'"tips_click" and "Contacts show" median timediffer:'
'404.52, seconds'
'Mean timediffer between prevalent events and target "Contacts show":302.63, seconds'
AxesSubplot(0.125,0.125;0.775x0.755)
We've substracted 5 most popular sessions scenarios:
advert_open, map, tips_show 300
contacts_show, map, tips_show 176
advert_open, map, search_3, tips_show 86
search_4, search_5, search_6, tips_show 80
contacts_show, photos_show, search_1 75
We can see that the user action "tips_show" and "map" are the most common ways to reach target event.
But this actions make only around 25% conversion to "Contacts show".
It seems that the raise of conversion of "tips_show" and "map" will increase number of "Contacts show" events.
"Photos_show" and "search_1" make around 67% conversion to "Contacts show".
Raise of commiting "photos_show" and "search_1" is also an option, as they show the biggest conversion.
"Favorites_add" makes 42% conversion to "Contacts show".
Raise of commiting "Favorites_add"
Also "tips_click" is showing pretty good conversion -38%, but takes a small part of events flows.
Raise of commiting "tips_click"
"Advert_open" makes the smallest conversion to target event - 15%
Raise of conversion "Advert_open"
Mean timediffer between prevalent events and target "Contacts show" is around 300 seconds (5 minutes)
H0: Conversion to contact views is equal between two groups: ones who perform tips_show and tips_click actions, and ones with only tips_show.
H1: Conversion to contact views between two groups differs
def z_test(successes1, successes2, trials1, trials2, alpha=0.05):
p1 = successes1 / trials1
p2 = successes2 / trials2
p_combined = (successes1 + successes2) / (trials1 + trials2)
difference = p1 - p2
z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
print('p-value: ', "%.20f" % p_value )
if (p_value < alpha):
display('Reject the null hypothesis, there are statistically significant differences between the samples')
else:
display('It was not possible to reject the null hypothesis, there are no statistically significant differences in the samples')
session5 = data.sort_values(by='event_name').groupby('session_id', as_index=False).agg(event_flow=('event_name', 'unique'))
session5['event_flow'] = [', '.join(map(str, i)) for i in session5['event_flow']]
target1 = 'tips_show'
target2 = 'tips_click'
success = 'contacts_show'
x1 = session5.query('event_flow.str.contains(@target1)', engine='python')
x2 = session5.query('event_flow.str.contains(@target1) and event_flow.str.contains(@target2)', engine='python')
trials1 = data.query('session_id in @x1.session_id')['user_id'].nunique()
trials2 = data.query('session_id in @x2.session_id')['user_id'].nunique()
y1 = x1.query('event_flow.str.contains(@success)', engine='python')
y2 = x2.query('event_flow.str.contains(@success)', engine='python')
successes1 = data.query('session_id in @y1.session_id')['user_id'].nunique()
successes2 = data.query('session_id in @y2.session_id')['user_id'].nunique()
z_test(successes1, successes2, trials1, trials2, alpha=0.05)
p-value: 0.15008551289817906316
'It was not possible to reject the null hypothesis, there are no statistically significant differences in the samples'
H0: Conversion to contact views is equal between group of users who performed favorites_add and who did not.
H1: Conversion to contact views between two groups differs
target = 'favorites_add'
success = 'contacts_show'
x1 = session5.query('event_flow.str.contains(@target)', engine='python')
x2 = session5.query('~event_flow.str.contains(@target)', engine='python')
trials1 = data.query('session_id in @x1.session_id')['user_id'].nunique()
trials2 = data.query('session_id in @x2.session_id')['user_id'].nunique()
y1 = x1.query('event_flow.str.contains(@success)', engine='python')
y2 = x2.query('event_flow.str.contains(@success)', engine='python')
successes1 = data.query('session_id in @y1.session_id')['user_id'].nunique()
successes2 = data.query('session_id in @y2.session_id')['user_id'].nunique()
z_test(successes1, successes2, trials1, trials2, alpha=0.05)
p-value: 0.01336115333748821854
'Reject the null hypothesis, there are statistically significant differences between the samples'
H0: Conversion to contact views is equal between group of users who used photos_show and who did not.
H1: Conversion to contact views between two groups differs
target = 'photos_show'
success = 'contacts_show'
x1 = session5.query('event_flow.str.contains(@target)', engine='python')
x2 = session5.query('~event_flow.str.contains(@target)', engine='python')
trials1 = data.query('session_id in @x1.session_id')['user_id'].nunique()
trials2 = data.query('session_id in @x2.session_id')['user_id'].nunique()
y1 = x1.query('event_flow.str.contains(@success)', engine='python')
y2 = x2.query('event_flow.str.contains(@success)', engine='python')
successes1 = data.query('session_id in @y1.session_id')['user_id'].nunique()
successes2 = data.query('session_id in @y2.session_id')['user_id'].nunique()
z_test(successes1, successes2, trials1, trials2, alpha=0.05)
p-value: 0.69348793142563436298
'It was not possible to reject the null hypothesis, there are no statistically significant differences in the samples'
There are no statistically significant differences in conversion to "contacts_show" between the ones who perform
tips_show and tips_click actions, and ones with only tips_show .
There are statistically significant differences in conversion to "contacts_show" between the users who performed
favorites_add and who did not.
There are no statistically significant differences in conversion to "contacts_show" between the users who performed
photos_show and who did not.
10975 sessions has been allocated and has been reduced to 2296 sessions after eliminating double unique event sessions.
The most common scenarios of successful event flow are:
"Tips_show" and "map" are the most common ways to reach target event.
But this actions make only around 25% conversion to "Contacts show".
"Photos_show" and "search_1" make around 67% conversion to "Contacts show".
"Favorites_add" makes 42% conversion to "Contacts show".
Also "tips_click" is showing pretty good conversion -38%, but takes a small part of events flows.
"Advert_open" makes the smallest conversion to target event - 15%
Mean timediffer between prevalent events and target "Contacts show" is around 5 minutes
There are statistically significant differences in conversion to "contacts_show" between the users who performed
favorites_add and who did not.
There are no statistically significant differences in conversion to "contacts_show" between the ones who perform
tips_show and tips_click actions, and ones with only tips_show .
There are no statistically significant differences in conversion to "contacts_show" between the users who performed
photos_show and who did not.
In the order of impact: